Fix stdio client shutdown bugs and rebuild the stdio test suite#2773
Fix stdio client shutdown bugs and rebuild the stdio test suite#2773maxisbey wants to merge 16 commits into
Conversation
…ignals
The stdio cleanup tests asserted upper bounds on elapsed wall-clock time
(cleanup took < 6.0s, < 5.0s, < 3.0s, between 1.5s and 4.5s) and paced
themselves with fixed sleeps ("let the process start"). Upper bounds on
elapsed time fail on slow runners — spawning an interpreter on a loaded
CI box can eat the whole margin — and the sleeps raced process startup.
Every cleanup test now uses the socket-based liveness probe the
process-tree tests already had: the child connects back to a listener
owned by the test (deterministic readiness, replacing the sleeps; setup
lines such as installing a signal handler are guaranteed to have run
before the connect), and the kernel closing that socket when the
process dies is the proof of termination (replacing the elapsed-time
upper bounds). Time bounds that remain are generous fail_after hang
guards, not performance claims, plus one deliberate lower bound:
test_stdio_client_stdin_close_ignored still asserts cleanup takes at
least PROCESS_TERMINATION_TIMEOUT, which pins the genuine time-based
contract that escalation must wait out the stdin-close grace period.
Lower bounds cannot fail from runner slowness.
Details worth knowing:
- test_stdio_client_sigint_only_process is renamed to
test_stdio_client_sigterm_ignoring_process: the cleanup path never
sends SIGINT (it escalates stdin-close -> SIGTERM -> SIGKILL), so the
old child's SIGINT handler never fired and SIGKILL was what actually
killed it. The test now states what it proves. Its whole-function
"pragma: lax no cover" stays, with the real reason documented:
coverage is enforced per CI job at 100% including on Windows runners,
where the test is skipped and its body would count as uncovered.
- test_stdio_client_graceful_stdin_exit now proves "no signal was
needed" directly: the child sends a marker over the socket only on
the stdin-EOF exit path before exiting on its own. The old upper
bound (< 3.0s) could not distinguish a stdin-driven exit from a
SIGTERM-driven one, which also completes in about 2.2s.
- TestChildProcessCleanup's three tests become top-level functions per
the no-test-classes rule; bodies unchanged.
- The defensive except (TimeoutError, Exception) block and the
pragma'd cancelled_caught/pytest.fail arms are gone: hang guards are
plain anyio.fail_after, whose TimeoutError fails the test with no
dead lines to exclude from coverage.
- PROCESS_TERMINATION_TIMEOUT is intentionally not monkeypatched
anywhere: the 2s grace period is the contract under test in the
lower-bound test, and everywhere else the tests no longer measure
time at all, so shrinking it would only save ~6s of real escalation
waiting at the cost of testing a constant production never uses.
9226789 to
e045ebb
Compare
64b8859 to
5067c98
Compare
…spurious kills Four user-facing bugs in the stdio client's shutdown path, all rooted in the same design problem — the shutdown sequence depended on pipe state and was not protected from cancellation: - Cancelling stdio_client (a client timeout, app shutdown) skipped the entire shutdown sequence: the first await in the finally block re-raised the pending cancellation, so stdin was never closed gracefully and the process tree was never terminated. Control then fell through to anyio's Process.aclose, whose shielded wait only returns once every pipe has closed - and a pipe inherited by a grandchild (npx- and uv-style servers always have one) never closes, so the client deadlocked forever. The shutdown now runs inside a shielded cancel scope with every wait time-bounded, and never relies on a pipe-gated wait. - The grace-period check used process.wait(), which on the asyncio backend resolves only when the process has exited AND its pipes have closed. A well-behaved server that exited instantly on stdin closure but left a background child holding stdout was misclassified as hung, burned the full grace period, and got its tree terminated with a spurious warning. The wait now polls returncode, which reflects process death alone. - Process-tree termination derived the group ID with os.getpgid(pid), which fails once the leader has been reaped even while its descendants are alive — and the fallback then "terminated" the dead leader, leaking every descendant. Since the process is spawned with start_new_session=True, the pgid is by definition the leader's pid; use it directly, and treat ProcessLookupError from killpg as "group already gone" rather than a reason to fall back. - Writing to a dead server's stdin surfaced a raw backend exception (ConnectionResetError inside an exception group) to the caller instead of a clean closed-stream signal. stdin_writer now treats BrokenResourceError and ConnectionError like ClosedResourceError. Windows fixes, by analysis (CI must validate): FallbackProcess.wait() ran Popen.wait() in a non-cancellable worker thread, so the timeout around the grace period could never fire and shutdown could hang indefinitely — it now polls cancellably, and returncode reflects death without requiring a wait. terminate_windows_process_tree dropped its timeout_seconds parameter, which was documented but never used (Job Object termination is immediate; the docstring now says so). Cleanup in the same files: the escalation now lives in one function with two named timeouts instead of three hardcoded 2.0s; the Job Object is tracked in a WeakKeyDictionary instead of a monkey-patched private attribute on anyio's Process; the deprecated terminate_windows_process is removed (migration.md updated); the always-true CREATE_NO_WINDOW hasattr dance and a retry path that double-spawned on spawn failure are gone; the client's JSON parse error handler catches ValueError instead of Exception.
… set
The stdio client's logic — framing, parse errors, the shutdown
escalation decisions — is now tested in process against a fake process
injected through the spawn seam, and only the properties that are OS
behaviour keep real subprocesses. The in-process tests include
regressions for the shutdown bugs fixed in the previous commit:
cancellation must still run the full shutdown, tree kill must reach
children after the leader has exited, writes racing server death must
surface clean closure, exiting with a server message still undelivered
must be a clean exit, and a server exiting on stdin close must never
be terminated. Each was verified to fail against the previous
implementation.
The real-process set shrinks to: one consolidated tree-kill test driven
through the public stdio_client (a parent that exits instantly on
SIGTERM, a child, and a grandchild — pinning group inheritance, atomic
group kill, and dead-leader robustness in one spawn), the
SIGTERM-ignoring SIGKILL escalation test, a dead-group no-op test, and
exec-failure ENOENT. Grace and kill timeouts are shortened via
monkeypatch in the real tests: the escalation decision is pinned in
process against a shortened grace, the SIGTERM-then-SIGKILL escalation
itself is exercised only by the real escalation test, and the
production constants' values are deliberately unpinned. The stdio
surface drops from ~11s to ~3.5s.
Deleted as subsumed or broken:
- The two tee-based tests: the interaction suite's subprocess e2e pins
the real-pipe round trip with a real server, and tee could never
deterministically exercise the chunk-reassembly buffer that the new
in-process framing test pins (real pipes deliver short messages as
whole lines).
- test_stdio_client_bad_path: its script only "failed" because
non-existent-file.py happens to be a Python expression that raises
NameError; the property (stdout EOF fails initialize with
CONNECTION_CLOSED) is now pinned in process.
- The basic/early-parent cleanup tests and universal-cleanup /
graceful-stdin-exit tests, folded into the consolidated tree test and
the in-process graceful/escalation tests.
- tests/issues/test_1027_win_unreachable_cleanup.py: its lifespan
cleanup chain is covered by the interaction e2e's clean-exit marker
plus the lifespan tests, it polled marker files with fixed sleeps,
and it leaked its child process when an assertion failed mid-test.
Its only helper module (tests/shared/test_win32_utils.py) goes with
it. The MCPServer.run("stdio") coverage its subprocess scripts
provided is replaced by an in-process test that drives run() over
injected stdio and asserts a real response.
tests/issues/test_552_windows_hang.py now asserts initialize succeeded
instead of swallowing every exception (its handcrafted response also
sent the invalid protocolVersion "1.0", so it errored on every run and
passed anyway), the stdio server tests are bounded by fail_after, and
the elicitation tests that claimed "stdio" in their names run, and now
say they run, over the in-memory transport.
5067c98 to
3035984
Compare
New tests/transports/stdio/ suite pinning the client transport's process-lifecycle contracts against real subprocesses, synchronized through TCP liveness sockets (kernel FIN/RST and connect-back) with no sleeps and no timing assertions: - Both platforms: a well-behaved server exits on stdin close and is reaped without escalation; cancellation mid-session still terminates the whole server tree; a server that exits mid-session keeps its own exit code; server stderr reaches the errlog file; FallbackProcess reports death through returncode without wait() and its wait() is cancellable (the SelectorEventLoop fallback wraps a plain Popen, so these run on every platform with waitid where available). - POSIX: a gracefully-exited server's background child survives client shutdown (the documented policy), proven by an echo round trip plus a never-escalated seam; the surviving child's next write to its inherited stdout fails with EPIPE once the client has released its pipe ends. - Windows: the same scenario's documented opposite outcome - the child is reaped when the Job Object handle closes at shutdown; a SelectorEventLoop session engages the FallbackProcess wrapper and completes a clean lifecycle; a print()-based server emitting CRLF line endings round-trips messages. Shared helpers live in _liveness.py (extracted from the real-process section of tests/client/test_stdio.py; consolidating that file onto this module is follow-up work), with conftest fixtures recording the spawn and terminate seams and reaping spawn-time process groups on teardown.
Source and docs (comment/docstring-level; one dead-code removal): - Correct the shutdown shield comment: a native task.cancel() delivered while cleanup is in progress can abort it regardless of count; the consumed-at-the-yield arithmetic only covered cancellation that initiated teardown. - Scope the pipe-gated process.wait() claims to asyncio on Python 3.11+ (3.10 and trio resolve on process exit alone), in the _wait_for_process_exit docstring and the migration guide. - Add sunset notes to the cpython#106749 tracer workarounds (3.11-only; revert when 3.11 support is dropped). - Document stdio_client's raised exceptions (Raises: section). - Document the Job Object pre-assignment window in create_windows_process and at the assignment call site; document the _process_jobs WeakKeyDictionary's two load-bearing invariants (PyHANDLE values, identity-hashed weakref keys); fix a wrong comment about the CloseHandle failure path in _maybe_assign_process_to_job. - Drop FallbackProcess.stdin_raw/stdout_raw (write-only attributes). Tests: - Fix the inert leak canary in the OSError spawn-failure test: the pytest.raises ExceptionInfo pinned the leaked streams across gc.collect(); drop it before collecting (mutation-verified). - Correct the EPERM escalation test's pre-fix description (the old code fell back to a leader-only kill, leaking other group members - it did not hang forever) and trim an unasserted claim from the write-after-death docstring. - Add a CRLF-framed message to the chunk-reframing test, pinning the trailing-CR tolerance Windows servers rely on, on every platform. - Pin the grace wait's returncode read (what reaps the leader's zombie on trio) with a read-counting stub, parametrized over asyncio and trio; deleting the read fails the test on both backends. - Add the missing width-justification comment to one fail_after(10.0).
The job-handle-close reap test failed on all windows-latest legs with a silent 15s accept timeout: the grandchild's connect-back never arrived, and xdist swallows the server's stderr, so the logs could not say which link of the spawn chain broke. Reshape the server to the form the passing cancellation test already proves on Windows CI: it connects back to the liveness listener itself after spawning the child, then parks on stdin. The test now accepts two connections, so a failure distinguishes "server never ran its connect line" from "child never connected". Capture the server's stderr via errlog into a temp file and attach it to every failure message, and wrap the child spawn in a try/except that reports to stderr before re-raising. The contract is unchanged: both sockets must close after a graceful exit (server FIN, child killed by the job-handle close), with returncode 0 and the escalation seam untouched.
The job-reap test still fails on windows-latest with an empty captured stderr, which proves the server ran its whole script but cannot say anything about the child. Close the remaining blind spots: - Route the child's stderr into the server's (which errlog captures), so a child that dies at startup leaves its traceback in the failure message instead of vanishing into the hidden console. - Print a child-started marker to stderr first, splitting "child never spawned" from "child started but could not connect". - Report how many of the two liveness connections arrived and which leg is missing, instead of one undifferentiated timeout. - Record when the stdio_client context was entered and include the spawn-to-entry split in the failure message, so a stalled spawn is distinguishable from a stalled child. Verified on POSIX by running the same choreography with a deliberately unreachable child port: the failure message names the missing leg and quotes the child's ConnectionRefusedError traceback verbatim.
The instrumented CI runs showed the grandchild booted (its startup marker reached stderr on some legs) and then died tracelessly before it could connect — no traceback, no SDK escalation, no job-handle close. The remaining kill mechanism in the chain is the venv launcher layer: sys.executable in a uv-managed venv is a trampoline that runs the real interpreter inside its own kill-on-close Job, and that private job machinery proved fatal to grandchildren on the CI runners. The server now spawns its child through sys._base_executable, removing the foreign launcher layer from the grandchild chain. The contract under test is unchanged: the child still inherits the SDK's Job Object at CreateProcess and must die when the job handle closes at shutdown. Also report the child's poll() status after stdin EOF ends the server (child-rc on stderr, zero cost on the healthy path), so any remaining failure message names the child's exit status - the discriminator between "still alive but unreachable" and "killed, by this code".
The grandchild was freezing at interpreter startup for as long as the server sat in its blocking stdin read: CPython queries fd 0 during startup, the child's inherited stdin was the server's stdin pipe (a synchronous file object), and Windows serializes that query behind the server's pending ReadFile. The child only thawed when the client closed stdin at shutdown - at which point the job-handle close correctly reaped it before it could connect. Spawning the child with stdin=DEVNULL removes the coupling, so it boots immediately and the kill-on-job-close contract can be observed. Also reverts the previous commit's base-interpreter bypass: the venv launcher was not involved (the freeze reproduces identically with the base interpreter). The same stdin coupling applies to any Windows stdio server that spawns Python children without redirecting their stdin.
After shutdown closes the child's stdin, the child has to unwind its run loop, write its clean-exit stderr line, and let coverage's atexit hook persist the subprocess data file before the stdin-close grace expires. On a slow Windows runner the production 2s grace was too tight: the escalation killed the child mid-atexit - after the asserted stderr line, so the test stayed green - and the silently missing data file failed the 100% coverage gate at 99.95%. Widen the grace to 10s for this one test; the timeout is not under test here.
Restructure stdio_client to read top-down: spawn the server process before creating the transport streams (a failed spawn now has nothing to clean up), extract _parse_line, replace the reader's delivering flag with two sequential phases (deliver, then drain), name the shutdown sequence as a shutdown() helper with numbered steps, and rename _shutdown_process_tree to _stop_server_process so it no longer reads as a synonym of _terminate_process_tree. Add a ServerProcess alias for the process union. Use the same cancellation idiom as the other client transports for the bounded waits and task teardown (move_on_after, tg.cancel_scope.cancel()) instead of deadline-poll loops and dedicated per-task scopes; a cancel_shielded_checkpoint after each site that can throw through the await chain keeps coverage accurate on Python 3.11 (python/cpython#106749). Drop four scheduling checkpoints in exception handlers that nothing depends on. Compress comments to the load-bearing rationale, anchor the shutdown sequence to the MCP lifecycle specification, and explain why function-valued ("()"-prefixed) environment variables are skipped.
- Close the read stream before the writer-flush wait: the reader's stdout drain is what unwedges a server blocked writing, so a wedged writer's flush can now complete instead of burning the flush window (on trio the aborted send also truncated an accepted message mid-line). - Run the reader's drain phase shielded and on raw bytes: caller cancellation no longer skips the drain (which got well-behaved servers spuriously tree-killed), and a dying server flushing non-UTF-8 output can no longer abort shutdown with a decode error. - Re-check returncode when the grace wait times out, so a server dying in the final poll interval is not spuriously escalated. - Release the subprocess transport by duck type, not isinstance: uvloop's transport is not an asyncio.SubprocessTransport, so the deterministic fd release silently no-op'd there. Tolerate PermissionError from the close (Python <= 3.12 re-kills a setuid child without catching it). - Log killpg EPERM at warning, not error: macOS raises EPERM for groups containing only unreaped zombies, i.e. a cleanly exited server. - Test teardown tolerates EPERM from killpg (same macOS zombie case), and the interaction test's outer deadline now exceeds the grace period it configures. New regression tests pin each behavioral fix. - Document the terminate_posix_process_tree group-leader requirement and the win32 utilities logger rename in the migration guide.
There was a problem hiding this comment.
I didn't find any bugs in this change, but it rewrites the stdio client's shutdown path (cancellation shielding, process-tree termination, Windows Job Object handling) and includes documented breaking behavior changes, so it needs a human reviewer's judgment rather than auto-approval.
Extended reasoning...
Overview
This PR rewrites the shutdown sequence of stdio_client in src/mcp/client/stdio.py, overhauls mcp/os/posix/utilities.py (process-group SIGTERM/SIGKILL escalation, EPERM handling) and mcp/os/win32/utilities.py (Job Object tracking via a weak-keyed map, deterministic handle close, removal of the deprecated terminate_windows_process), updates docs/migration.md, and rebuilds the stdio test suite around an in-process FakeProcess plus a new real-subprocess lifecycle suite under tests/transports/stdio/.
Security risks
No direct security exposure (no auth, crypto, or input-handling surfaces). The risk profile is reliability-shaped: process lifecycle management, signal/Job Object semantics, cancellation shielding, and resource cleanup. A subtle bug here would manifest as leaked or orphaned server processes, hung shutdowns, or platform-specific (Windows/macOS/trio) regressions rather than a vulnerability.
Level of scrutiny
High. The transport shutdown path is production-critical for every stdio-based MCP client, the change is large and concurrency-heavy (cancel scopes, shields, backend differences between asyncio 3.10/3.11+ and trio), it removes a public deprecated API, and it intentionally changes behavior (POSIX no longer kills a gracefully-exited server's surviving children; Windows reaps job survivors deterministically). That child-survival policy change in particular is a design decision a maintainer should explicitly sign off on, not something an automated approval should accept.
Other factors
The automated bug-hunting pass found no issues, the PR description documents per-fix regression tests with claimed 100% coverage and extensive stress runs, and the maintainer's earlier inline comments (docstring style, the timing-based test, the deleted #1027 regression test) appear to have been addressed with follow-up commits. Those are positive signals, but given the scope, platform-specific behavior, and breaking changes, this falls well outside the simple/mechanical category eligible for shadow approval.
- Wait for process exit with a plain deadline-polling loop instead of a cancel scope: the timeout path no longer throws through the await chain, which removes the coverage-resync checkpoint it needed (python/cpython#106749) and collapses the double-exit/re-check shape into a single while loop - Extract the reader's shutdown drain (_drain_stdout), the tolerant pipe closes (_close_pipe, replacing two pragma-guarded inline blocks), the POSIX group-liveness probe (_group_alive), and Job Object handle closing (_close_job_handle) - Add a module docstring telling the shutdown story once and trim comments to the constraints they pin No behavior change intended; the full suite, coverage, and the stdio stress runs are unchanged.
stdio_client now carries the same AsyncGenerator[TransportStreams, None] annotation as streamable_http_client, so it formally satisfies the Transport protocol instead of leaking anyio's concrete memory-stream types. Also adds the missing -> None on FallbackProcess.__init__ and retypes the test helper to the ReadStream protocol.
Adopt a uniform comment standard for the stdio transport and its tests: PEP 257 one-line summaries, docstring bodies only for contracts or for warning-of-consequences notes (why the obvious alternative is wrong), inline comments at most two lines, no Args/Returns boilerplate on private helpers, ASCII punctuation only. Facts guarding non-obvious OS behavior (killpg/EPERM semantics, the pipe-gated process.wait(), the Windows stdin-pipe interpreter freeze, CPython gh-106749) are kept in compressed form; pure narration is deleted. No code changes outside comments and docstrings.
Every docstring now opens with a single-line summary sentence followed by a blank line and the body. Restores the useful content that an earlier slimming pass cut from create_windows_process: the SelectorEventLoop / NotImplementedError fallback explanation, the Job Object purpose, the assignment-window caveat, and a Returns section explaining the Process | FallbackProcess union. Docstring-only change; no code or behavior changes.
There was a problem hiding this comment.
I didn't find any bugs, but this is a large PR rewriting the stdio client's shutdown path (process-group/Job Object termination, cancellation shielding) with documented breaking changes, so it needs a careful human review rather than an automated approval.
Extended reasoning...
Overview
The PR rewrites the stdio client transport's shutdown sequence in src/mcp/client/stdio.py, the POSIX process-group termination in src/mcp/os/posix/utilities.py, and the Windows Job Object handling in src/mcp/os/win32/utilities.py, and rebuilds the entire stdio test surface (a new fake-process unit suite, a new real-subprocess lifecycle suite under tests/transports/stdio/, plus deletion/rewrite of two issue-regression tests). It also documents two breaking changes in docs/migration.md: removal of the deprecated terminate_windows_process API and a behavioural change where a gracefully-exited server's surviving children are no longer killed on POSIX.
Security risks
No direct security exposure (no auth, crypto, or input-parsing surfaces are touched), but the code does signal handling, killpg on process groups, and Windows Job Object termination — mistakes here manifest as orphaned/leaked server processes or hung shutdowns rather than as injection or data-exposure issues. The EPERM handling around killpg and the new policy of leaving a graceful server's children alive are deliberate behaviour decisions worth a maintainer's sign-off rather than risks per se.
Level of scrutiny
High. This is core, production-critical transport code on every stdio_client user's shutdown path, with platform-divergent semantics (asyncio vs trio, Python 3.10 vs 3.11+, POSIX vs Windows fallback) and intentional breaking changes. The cancellation-shielded shutdown, the returncode-polling exit wait, and the private-attribute walk to close the asyncio subprocess transport are all subtle and not the kind of change a bot should approve without a human weighing the design (especially the child-survival policy change, which is a user-visible behaviour decision, not just a bug fix).
Other factors
The automated bug-hunting pass over the diff surfaced no bugs, the PR claims 100% line+branch coverage with regression tests for each fix, and a human reviewer is already engaged (their inline comments appear addressed). My deferral is about scope and design judgement, not about specific defects found.
The stdio test suite has a long flake history (#1775, #1158, #559, #401). Auditing it surfaced real bugs in the client transport's shutdown path. This PR fixes the transport and rebuilds the stdio tests so they are deterministic.
Motivation and Context
Shutdown bugs fixed (each has a regression test that fails on
main):stdio_clientcould deadlock shutdown forever.Process.aclose(), which waits for all pipes to close — and a server with a child of its own (npx ...,uv run ...) holds the pipes open indefinitely.os.getpgid()raises once the leader is reaped, leaking every surviving child. The group is now killed via the spawn-time pgid.process.wait()is pipe-gated on asyncio 3.11+, so a server that exited promptly but left a background child running was misclassified as hung and its whole tree killed. The wait now keys onreturncode, which reflects process death alone.wait()could not be cancelled.ConnectionResetError; reading leakedBrokenResourceError. Both now surface as clean session closure. Fixes BrokenResourceError race condition in stdio_client cleanup when context exits quickly #1960.ResourceWarninginto whatever ran next.The whole sequence now lives in one cancellation-shielded function with named timeouts: close stdin → grace period keyed on process death → terminate tree → bounded wait → kill tree. See
docs/migration.md.Test rebuild:
tests/transports/stdio/suite pins the end-to-end lifecycle on both platforms: clean exit without escalation, cancellation kills the whole tree, stderr routing, and the child-survival policy.tee-based tests andtest_1027_win_unreachable_cleanup.py(their properties are re-pinned above).test_552_windows_hang.pynow asserts real outcomes instead of swallowing exceptions.How Has This Been Tested?
main, and the OS-level claims were verified empirically on asyncio and trioBreaking Changes
terminate_windows_process(deprecated) is removed;terminate_windows_process_treeno longer takes the unusedtimeout_seconds.Details in
docs/migration.md.Types of changes
Checklist
Additional context
Two known limitations are documented in code rather than fixed:
task.cancel()delivered mid-cleanup can still abort it (no backend-neutral way to refuse native cancellation).CREATE_SUSPENDEDspawn is the follow-up if it bites).AI Disclaimer